14 research outputs found

    Analytical cost metrics: days of future past

    Get PDF
    2019 Summer.Includes bibliographical references.Future exascale high-performance computing (HPC) systems are expected to be increasingly heterogeneous, consisting of several multi-core CPUs and a large number of accelerators, special-purpose hardware that will increase the computing power of the system in a very energy-efficient way. Specialized, energy-efficient accelerators are also an important component in many diverse systems beyond HPC: gaming machines, general purpose workstations, tablets, phones and other media devices. With Moore's law driving the evolution of hardware platforms towards exascale, the dominant performance metric (time efficiency) has now expanded to also incorporate power/energy efficiency. This work builds analytical cost models for cost metrics such as time, energy, memory access, and silicon area. These models are used to predict the performance of applications, for performance tuning, and chip design. The idea is to work with domain specific accelerators where analytical cost models can be accurately used for performance optimization. The performance optimization problems are formulated as mathematical optimization problems. This work explores the analytical cost modeling and mathematical optimization approach in a few ways. For stencil applications and GPU architectures, the analytical cost models are developed for execution time as well as energy. The models are used for performance tuning over existing architectures, and are coupled with silicon area models of GPU architectures to generate highly efficient architecture configurations. For matrix chain products, analytical closed form solutions for off-chip data movement are built and used to minimize the total data movement cost of a minimum op count tree

    BB-ML: Basic Block Performance Prediction using Machine Learning Techniques

    Full text link
    Recent years have seen the adoption of Machine Learning (ML) techniques to predict the performance of large-scale applications, mostly at a coarse level. In contrast, we propose to use ML techniques for performance prediction at a much finer granularity, namely at the Basic Block (BB) level, which are single entry, single exit code blocks that are used for analysis by the compilers to break down a large code into manageable pieces. We extrapolate the basic block execution counts of GPU applications and use them for predicting the performance for large input sizes from the counts of smaller input sizes. We train a Poisson Neural Network (PNN) model using random input values as well as the lowest input values of the application to learn the relationship between inputs and basic block counts. Experimental results show that the model can accurately predict the basic block execution counts of 16 GPU benchmarks. We achieve an accuracy of 93.5% in extrapolating the basic block counts for large input sets when trained on smaller input sets and an accuracy of 97.7% in predicting basic block counts on random instances. In a case study, we apply the ML model to CUDA GPU benchmarks for performance prediction across a spectrum of applications. We use a variety of metrics for evaluation, including global memory requests and the active cycles of tensor cores, ALU, and FMA units. Results demonstrate the model's capability of predicting the performance of large datasets with an average error rate of 0.85% and 0.17% for global and shared memory requests, respectively. Additionally, to address the utilization of the main functional units in Ampere architecture GPUs, we calculate the active cycles for tensor cores, ALU, FMA, and FP64 units and achieve an average error of 2.3% and 10.66% for ALU and FMA units while the maximum observed error across all tested applications and units reaches 18.5%.Comment: Accepted at the 29th IEEE International Conference on Parallel and Distributed Systems (ICPADS 2023

    AN5D: Automated Stencil Framework for High-Degree Temporal Blocking on GPUs

    Full text link
    Stencil computation is one of the most widely-used compute patterns in high performance computing applications. Spatial and temporal blocking have been proposed to overcome the memory-bound nature of this type of computation by moving memory pressure from external memory to on-chip memory on GPUs. However, correctly implementing those optimizations while considering the complexity of the architecture and memory hierarchy of GPUs to achieve high performance is difficult. We propose AN5D, an automated stencil framework which is capable of automatically transforming and optimizing stencil patterns in a given C source code, and generating corresponding CUDA code. Parameter tuning in our framework is guided by our performance model. Our novel optimization strategy reduces shared memory and register pressure in comparison to existing implementations, allowing performance scaling up to a temporal blocking degree of 10. We achieve the highest performance reported so far for all evaluated stencil benchmarks on the state-of-the-art Tesla V100 GPU

    Spin-Hall effect of light at a tilted polarizer

    No full text
    We describe the spin-Hall effect of light (as well as the angular Goos-Hänchen effect) at a tilted linear-dichroic plate, such as a usual linear polarizer. Although the spin-Hall effect at a tilted polarizer was previously associated with the geometric spin-Hall effect of light (which was contrasted to the regular spin-Hall effect) [Phys. Rev. Lett. 112, 113902 (2014)], we show that the effect is actually an example of the regular spin-Hall effect that occurs at tilted anisotropic plates [Optica 3, 1039 (2016)]. Moreover, our approach reveals the angular spin-Hall shift, which is absent in the “geometric” approach. We verify our theory experimentally using the method of quantum weak measurements.Air Force Office of Scientific Research (FA9550- 14-1-0040); Army Research Office (W911NF-18-1-0358); Core Research for Evolutional Science and Technology (JPMJCR1676); Japan Science and Technology Agency (QLEAP); Japan Society for the Promotion of Science (VS.059.18N); John Templeton Foundation; Science and Engineering Research Board (TAR/2018/000552); Australian Research Council; Science and Engineering Research Board (SERB), India; Asian Office of Aerospace Research and Development (AOARD) (FA2386-18-1-4045)

    Energy Modeling and Optimization for Tiled Nested-Loop Codes

    No full text
    International audienceWe develop a methodology for modeling the energy efficiency of tiled nested-loop codes running on a graphics processing unit (GPU) and use it for energy efficiency optimization. % We use the polyhedral model, a We assume that a highly optimized and parametrized version of a tiled nested -- loop code, either written by an expert programmer or automatically produced by a polyhedral compilation tool -- is given to us as an input. We then model the energy consumption as an analytical function of a set of parameters characterizing the software and the GPU hardware. Most previous attempts at GPU energy modeling were based on low-level machine models that were then used to model whole programs through simulations, or were analytical models that required low level details. In contrast, our approach develops analytical models based on (i) machine and architecture parameters, (ii) program size parameters as found in the polyhedral model and (iii) tiling parameters, such as those that are chosen by auto-or manual tuners. Our model therefore allows efficient optimization of the energy efficiency with respect to a set of parameters of interest. We illustrate the framework on three nested-loop codes: Smith-Waterman, and one-dimensional and two-dimensional Jacobi stencils, and analyze the accuracy of the resulting models. We also show that the models can be used for optimal tile-size selection for energy efficiency. With optimal choice of model parameters the RMS error is less than 4%. Two factors allow us to attain this high accuracy. The first is domain-specificity: we focus only on tile-able nested-loop codes. The second is that we decouple the energy model from a model of the execution time, a known hard problem

    Spin-Hall effect and circular birefringence of a uniaxial crystal plate

    No full text
    We demonstrate theoretically and experimentally the fine lateral circular birefringence of uniaxial crystal plates, an example of the spin–Hall effect of light. We report experimental observations this effect using polarimetric and quantum-weak-measurement techniques

    Transformations for Energy Efficient Accelerated Chain Matrix Multiplication (TEE-ACM 2)

    No full text
    International audienceGPU matrix chain multiplication serves as a basis for a wide range of scientific domains like computer graphics, physics, and machine learning. While its time performance was studied for years, there has been significantly less effort in optimizing its energy efficiency. GPU power consumption is heavily impacted by the number of data transfers performed. In fact, a data transfer from global memory needs a thousand times more energy than a double precision arithmetic operation. Thus, minimizing data transfers is key for reducing the energy consumption. We present an energy efficient solution for Matrix Chain Multiplication on GPUs that minimizes computation as well as off-chip data transfers. For this, optimizations at three different levels are provided. For a single matrix multiplication, we use a blocking strategy that allows us to achieve the minimum number of global memory loads for a given amount of shared memory. We extend our approach to three matrices to decrease the data transfers even further. Finally, we use a parenthesizing algorithm that minimizes the number of computations as well as memory transfers for a whole sequence of matrices
    corecore